Reducing the Impact of Data Sparsity in Statistical Machine Translation

نویسندگان

Karan Singla

Kunal Sachdeva

Srinivas Bangalore

Dipti Misra Sharma

Diksha Yadav

چکیده

Morphologically rich languages generally require large amounts of parallel data to adequately estimate parameters in a statistical Machine Translation(SMT) system. However, it is time consuming and expensive to create large collections of parallel data. In this paper, we explore two strategies for circumventing sparsity caused by lack of large parallel corpora. First, we explore the use of distributed representations in an Recurrent Neural Network based language model with different morphological features and second, we explore the use of lexical resources such as WordNet to overcome sparsity of content words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Mitigation of Data Sparsity in Classifier-Based Translation

The concept classifier has been used as a translation unit in speech-to-speech translation systems. However, the sparsity of the training data is the bottle neck of its effectiveness. Here, a new method based on using a statistical machine translation system has been introduced to mitigate the effects of data sparsity for training classifiers. Also, the effects of the background model which is ...

متن کامل

Coling 2008 22 nd International Conference on Computational Linguistics Proceedings of the workshop on Speech Processing for Safety Critical Translation and Pervasive Applications

متن کامل

Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation

In statistical machine translation, data sparsity is a challenging problem especially for languages with rich morphology and inconsistent orthography, such as Persian. We show that orthographic preprocessing and morphological segmentation of Persian verbs in particular improves the translation quality of Persian-English by 1.9 BLEU points on a blind test set.

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Reducing the Impact of Data Sparsity in Statistical Machine Translation

نویسندگان

چکیده

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Mitigation of Data Sparsity in Classifier-Based Translation

Coling 2008 22 nd International Conference on Computational Linguistics Proceedings of the workshop on Speech Processing for Safety Critical Translation and Pervasive Applications

Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

عنوان ژورنال:

اشتراک گذاری